Secunetics Infrastructure Operations Runbook Addendum

Adding and Provisioning New Models

This micro-runbook describes the standard operating sequence to dynamically introduce an additional LLM asset to the active cluster configuration on the Mac Studio (M4 Max / 64GB Unified RAM). Follow these instructions precisely to ensure zero-downtime reconfiguration while observing strict hardware memory safety boundaries.

CRITICAL OPERATION STEP: Pre-Flight VRAM Evaluation
Because Apple Silicon shares system RAM dynamically with the GPU, the cluster safe operating target is capped at 42GB combined active weights (leaving a 6GB context memory expansion safety buffer out of the ~48GB Metal limit). Before deploying a new model, calculate the allocation footprint: The target model selected for this expansion is DeepSeek-Coder-6.7B-Instruct (Q4_K_M GGUF), which consumes roughly ~4.0GB, fitting safely within our VRAM headroom bounds.

Step-by-Step Deployment Routine

1 Sourcing and Downloading Model Weights

Activate the isolated Python virtual environment on the Mac Studio and use the Hugging Face CLI platform download link utility to pull down the validated layout file:

source ~/local-ai/venv-mlx/bin/activate hf download Bartowski/deepseek-coder-6.7b-instruct-GGUF deepseek-coder-6.7b-instruct-q4_k_m.gguf --local-dir ~/local-ai/models/gguf

2 Provisioning a New Port Allocation Slot

Our foundational GGUF process uses port 8080, and our MLX process uses port 8081. We allocate port 8082 to this third engine. Launch it persistently inside a brand-new, isolated background terminal window manager context:

tmux new-session -d -s engine-coding '~/local-ai/bin/llama-server -m ~/local-ai/models/gguf/deepseek-coder-6.7b-instruct-q4_k_m.gguf --port 8082 --host 127.0.0.1 -c 8192 -np 1'

3 Updating Gateway Router Matrix Definitions

Open the gateway configuration mapping file (~/local-ai/configs/litellm_config.yaml) using a standard console shell editor like nano. Append the new model block precisely under the active model_list array array:

model_list: - model_name: production-deep-context litellm_params: model: openai/unsloth/gemma-4-31B-it-qat-GGUF api_base: http://127.0.0.1:8080/v1 api_key: "not-needed" - model_name: production-ultra-fast litellm_params: model: openai/mlx-community/Qwen2.5-14B-Instruct-4bit api_base: http://127.0.0.1:8081/v1 api_key: "not-needed" # APPEND THE THIRD DEPLOYMENT TARGET PRECISELY HERE: - model_name: production-coding-assistant litellm_params: model: openai/Bartowski/deepseek-coder-6.7b-instruct-GGUF api_base: http://127.0.0.1:8082/v1 api_key: "not-needed" tpm: 150000 rpm: 1500

4 Cycling the Proxy Runtime Cache

Force the LiteLLM Proxy routing engine to recycle its internal process state. This reads the newly appended YAML matrix variables without interrupting your baseline underlying model execution windows:

# Kill the single active proxy engine tracking window tmux kill-session -t gateway-proxy # Relaunch the proxy process with security and OpenAPI schema documentation bypass flags intact tmux new-session -d -s gateway-proxy 'export NO_DOCS=true && export NO_REDOC=true && export NO_OPENAPI=true && source ~/local-ai/venv-mlx/bin/activate && litellm --config ~/local-ai/configs/litellm_config.yaml --port 4000 --host 0.0.0.0'

5 Verification and End-to-End Smoke Testing

First, execute tmux ls to verify that all 3 background windows are stable. Then, run this validation query from the console to verify that the entrypoint tracks traffic to port 8082:

curl -X POST http://localhost:4000/v1/chat/completions -H "Authorization: Bearer sk_live_mac_studio_master_init_key_2026" -H "Content-Type: application/json" -d '{ "model": "production-coding-assistant", "messages": [{"role": "user", "content": "Write a python function for quicksort."}] }'

Operational Maintenance & Troubleshooting Logs